Cross-Platform Omics Prediction procedure: a statistical machine learning framework for wider implementation of precision medicine

In this modern era of precision medicine, molecular signatures identified from advanced omics technologies hold great promise to better guide clinical decisions. However, current approaches are often location-specific due to the inherent differences between platforms and across multiple centres, thus limiting the transferability of molecular signatures. We present Cross-Platform Omics Prediction (CPOP), a penalised regression model that can use omics data to predict patient outcomes in a platform-independent manner and across time and experiments. CPOP improves on the traditional prediction framework of using gene-based features by selecting ratio-based features with similar estimated effect sizes. These components gave CPOP the ability to have a stable performance across datasets of similar biology, minimising the effect of technical noise often generated by omics platforms. We present a comprehensive evaluation using melanoma transcriptomics data to demonstrate its potential to be used as a critical part of a clinical screening framework for precision medicine. Additional assessment of generalisation was demonstrated with ovarian cancer and inflammatory bowel disease studies.

• leaving out Dressman et. al. 11 data as this article was retracted; • swapping the TCGA microarray data 12 for the RNA-Seq data. And we further subset the RNA-Seq cohort to those at tumour stage of III and IV with first metastasis recurrence; and • adding one extra data, abbreviated as "Japan B" from Yoshihara et. al. 8 .
We focus on the 126 gene signature reported in Yoshihara et. al. 8 and select genes that are present in all nine datasets. This results in 94 genes with corresponding 4,371 log-ratio features. Samples in this ovarian data collection are described in Supplementary Table 2.

Inflammatory bowel disease data
This data is measured on a NanoString platform with genes originally selected to study disease-associated risk loci in inflammatory bowel disease (IBD) by Peloquin et. al. 13 . We try to classify all 983 samples as either inflamed or not inflamed learning from 712 genes. The original authors provided the raw NanoString data on Gene Expression Omnibus repository under the accession number GSE73094. We perform log2-transformation on the raw counts data. This IBD data is chosen as the experiment extends over a few years with obvious batch effect as the chemical reagent was changed twice. This change in the use of reagent creates three batches, IBD2 (n = 303), IBD3 (n = 295) and IBD4 (n = 385) that mimics the implementation challenge such prognostic (or risk) models will face when it is implemented in a prospective setting. Supplementary Fig. 13 shows the batch effect in this IBD via a sample boxplot and a principal component analysis (PCA) plot. Previous efforts in addressing the stability in data quality is through normalisation techniques, for example, in Molania et. al. 14 . CPOP distinguishes itself from normalisation techniques through the use of log-ratios and making predictions simultaneously.

Cross-Platform Omics Prediction (CPOP) methodology
Cross-Platform Omics Prediction (CPOP) is a procedure that enables sample prediction across gene expression datasets with different scales (e.g. different sample means). We will use the generic phrase of "scale difference" to encompass all situations where multiple gene expression data exhibit different scales in the data due to, for example, the use of different experimental instruments/platforms or drifts in measurements in a prospective setting. We use the term 'biomarker' and 'feature' interchangeably. We will use the term predictive in a statistical sense and the term predictive markers in a generic way referring to all forms of biomarkers whether they are diagnostic, prognostic or predictive. We use the term training set interchangeably with reference set (or sets), and restrict usage of the term test set or validation set to situations with known patient outcome i.e. to situations where we are assessing or comparing the performance of CPOP. We use the term test sample or validation sample when the unknown subjects are to be predicted.
A major consideration in developing CPOP is to make predictions on a single-sample without normalisation or combining it with additional data. The CPOP procedure has the following three key characteristics: 1. CPOP uses (log-)ratios of genes as biomarkers (features), which are more stable than using individual gene expression values (see Step 2 of CPOP).
2. CPOP uses Elastic Net models to perform feature selection using weights proportional to the stability of features across more than one data set. This allows the selection of common predictive markers (see Step 3 of CPOP).
3. CPOP selects for features with high similarity in their between-data estimated effects (Step 4 of CPOP).
The use of (log-)ratios is not a new idea, but its usage in the context of patient outcome prediction challenges the common practice of using only genes as features. The calculation of these (log-)ratios is intended to obtain a quantitative measure for the relative differences between original omics features (e.g. genes), which we have found to be better preserved across different patient cohorts and data generation platforms. A key assumption in using these (log-)ratios is that the omics platforms in question can unbiasedly estimate the relative expression level of features. The prefix of 'log' simply reflects the prevalence of the log-transformation in dealing with omics data in practice. It is known that log(A/B) = log(A) − log(B) for positive values of A and B, thus, these log-ratios are able to quantify the difference in expression levels of A and B, provided the data is available on a log scale. In practice, there will be situations when published processed data are not log-transformed but with some sensible transformation on the raw measurement values. In such a case, it is assumed that these transformations will still be able to provide a sensible quantification of the relative difference between the two features A and B (e.g. subtracting one feature's value from another). We acknowledge that the (log-)ratio transformation is not suitable to be applied on a whole genome scale directly (e.g. whole genome RNA-sequencing) due to the dimensionality associated with searching over all paired features but point out that this approach is achievable on targeted omics assays. These targeted omics assays typically provide a higher signal-to-noise ratio for candidate features that are of higher clinical relevance, and are in wide use in clinical validation, the translational work and in the implementation phase of precision medicine. We also note that this (log-)ratio construction is a surrogate for the grander concept of "relative difference between features" and is certainly not limited to just transcriptomics data (e.g. similar concept exist in protein expression).

Statistical Background
Suppose we have a gene expression data matrix X ∈ R n×p where n is the number of samples and p is the number of genes on a gene expression platform. We define the "log-ratio matrix" as a matrix Z of dimension R n×q where q = p 2 and each column of Z is the pairwise difference between two log-transformed columns in X. Formally, each column of Z is given by enumerating all log-ratio features log(x l ) − log(x m ) for 1 ≤ l < m ≤ p. Thus, each column in the Z matrix is the log-ratio of the expression values of two genes. For the given log-ratio matrix Z ∈ R n×q , we denote each row of the matrix as z i for patient i = 1, . . . , n.
Let y ∈ R n be a vector that measures each patient's clinical outcome (e.g. patient tumour shrinkage in millimetres or prognostic outcome).
The weighted Elastic Net (WEN) model is a regularised regression model that solves for a regression coefficient vector β ∈ R q with the optimisation equation: where λ ∈ (0, ∞] and α ∈ [0, 1] are tuning parameters and w = (w 1 , . . . , w q ) is a sequence of weights placed on each of the q features. This will be explained in context of our crossplatform prediction later. The first component of Equation (1) is a linear loss function that measures the difference between the fitted value z ⊤ i β and the observed response value y i for sample i. This component can be readily substituted with any appropriate non-linear loss function depending on the variable type of the response variables (e.g. in 'Performance evaluation' section, logistic and Cox models are used to deal with binary and survival responses, respectively). The second component of the equation is a penalty on the magnitude of the estimated regression coefficients β. This component mixes a L 1 -norm penalty and a L 2 -norm penalty through the use of the α tuning parameter. The λ tuning parameter controls the total strength of penalisation in the overall equation.

CPOP procedure
In the Main Figure 1A, CPOP is presented as a five-step procedure, with the first step being data selection. This step is often context-related and one should select datasets with similar clinical phenotypes, e.g. independent samples at the same cancer stage. We describe the rest of the CPOP procedure below: Step 1. Data selection: the first step of data selection is dependent on the research questions to be addressed and one should select data with similar and appropriate clinical outcomes of interest. For example, the selected cohort can consist of independent samples at the same cancer stage. In the rest of the procedure, we assume we have two gene expression data and the CPOP model training will aim to find features consistently predictive in both data.
Step 2. Log-ratio matrices construction: Suppose we have two gene expression data and the associated log-ratio matrices as Z 1 ∈ R n 1 ×q and Z 2 ∈ R n 2 ×q , where n 1 and n 2 are the samples sizes for the two datasets. We do not impose the restriction of paired samples across the two data, however, we assume the two data measure the same q log-ratio features or we restrict our modelling to the common q log-ratio features between two datasets. For both data, we also have a clinical outcome measurement, denoted as y 1 ∈ R n 1 and y 2 ∈ R n 2 associated with data 1 and 2 respectively.
Step 3. Selecting common predictive features: compute a sequence of non-negative weights w j , j = 1, . . . , q that measure the column-wise statistical concordance between Z 1 and Z 2 . Fit a WEN model for both (Z 1 , y 1 ) and (Z 2 , y 2 ) using w j , j = 1, . . . , q to obtain estimated regression coefficients β ∈ R q with a penalty parameter α ∈ (0, 1]. Note the superscript denotes these regression coefficients are in the first step of CPOP. Since WEN generates sparse estimates for α ̸ = 0, thus it also naturally selects features from our data as those features with non-zero estimates in β (1) 2,j ̸ = 0} that collects all non-zero features selected into both models in both data.
In this paper, we primarily focus on the use of mean-difference weights: w j = | mean(Z 1j ) − mean(Z 2j )| for each j = 1, . . . , q, whereas other choices are also available in our CPOP package.
Step 4. Selecting features with between-data stability: define Z 1,S (1) and Z 2,S (1) as the matrices that we obtain when subsetting Z 1 and Z 2 to only the features present in S (1) . Then, fit an unweighted ridge regression model (i.e. a WEN model with no weights and α = 0) onto (Z 1,S (1) , y 1 ) and (Z 2,S (1) , y 2 ) to obtain β 1 and β 2 . Define another feature set S (2) = {j| sign( β We also include an additional step by iteratively fitting ridge regression to both Z 1,S (2) and Z 2,S (2) and in each iteration, we update the feature set S (2) by removing all features that do not satisfy sign( β 2,j ). This removal of features using the ridge models means that the size of S (2) is non-increasing with each iteration. This iterative step terminates when there is no further reduction in the size of S (2) .
Step 5. Final model estimation: the final CPOP models are the unweighted ridge regression models fitted onto (Z 1,S (2) , y 1 ) and (Z 2,S (2) , y 2 ). We will refer to these models as β In some situations, S (1) in step 3 of CPOP might not select enough predictive features as some versions of the Elastic Net models have a tendency to only select one of many correlated features and ignoring the rest 15 . The most notable example of this is the Lasso 16 . To overcome this, we can enlarge this feature set by introducing an iterative component. This can be done by first calculating S (1) as described above and then removing these features from the log-ratio matrices to obtain Z 1,S (1)c and Z 2,S (1)c , where S (1)c is the set complement of S (1) . We can then fit WEN onto (Z 1,S (1)c , y 1 ) and (Z 2,S (1)c , y 2 ) and update the feature set by adding the selected features in each iteration into S (1) . The removal of selected features in future iterations means informative correlated features are more likely to enter the feature set. The size of S (1) is non-decreasing and empirically we find that 20 iterations are usually enough for the size of S (1) to stabilise.

Practical implementation
Imputation: One particular important issue is the handling of missing values. CPOP model training assumes two complete data. However, if certain genes cannot be measured in a test/validation data, then imputation on the gene-level data (X test ) will be necessary prior to the calculation of the log-ratio matrix Z test . We make no particular recommendation on imputation methods as there are many good methods in the statistical literature and preference can vary among practitioners. However, we highlight a special case of missingness in gene expression data, which is when a gene is not measured at all. This situation arises typically when a gene fails quality control and any numerical values are deemed as invalid.
In this case, we propose to use the non-missing gene values in the two gene-level training data of CPOP (X 1 and X 2 ) to impute on X test . While a variety of methods can be used, in the CPOP package we provide a function (impute cpop) that utilises the Lasso estimator from the glmnet package 16;17 to make this imputation. Supplementary Figure 14 extends the results in Main Figure 3a and assumed some of the genes in the TCGA data (used as validation) are missing at random. Under 100 simulations for each level of missingness, prediction values from the imputed TCGA data is compared to the prediction values from the complete TCGA data. According to this assessment, the imputation method is able to handle up to 10% of missing genes without significantly impacting on the correlation and concordance of prediction. Setting a CPOP prediction cut-off: Applying the CPOP model to any new sample of interest will produce a predicted value similar to that of a linear regression model. In order to produce a binary class prediction (e.g. good vs poor prognosis), a single cut-off independent of between-data scale differences is needed. Here, we will set a cut-off at 0 for the linear predicted response variable y test,i = z test,i β CPOP to produce binary class predictions. This cut-off is chosen for its ease of interpretation in the context of two common types of clinical variable modelling. In binary classification, a cut-off for this linear predicted response implies a cut-off at 0.5 for the predicted probability, see section on 'Performance evaluation'. This is a threshold which a sample is equally likely to be assigned to one of two binary classes. Similarly, in Cox regression model in survival analysis, this cut-off at 0 corresponds to a cut-off at 1 for the predicted hazard ratio, see section on 'Performance evaluation'. Here, a value great than 1 implies an at-risk sample. Though it should be noted, as with all statistical models, this cut-off is only sensible if the validation data is biologically and clinically similar to training data. This notion of a data-independent cut-off is closely related to the challenge of betweendata scale differences and forms a critical part in our evaluation. A data-adaptive choice of this cut-off on the linear prediction values, for example, using the median, could easily bias the result in assuming there is about half of high-risk incoming samples, a poor assumption in prospective testing. On the other hand, the popular area under the receiver operating characteristic curve (AUC-ROC 18 ) metric for binary classification and the survival concordance index (C-index 19 ) are not appropriate measures as they avoid making a cut-off and thus mask the scale differences in the data. Nonetheless, we choose to report on both of these metrics in this manuscript so comparisons may be made with other publications.

Performance evaluation
We propose an evaluation framework with an emphasis on producing between-data predictions that are robust to scale differences. Supplementary Fig. 10 summarises the evaluation framework we have. For the evaluations below, we choose to use the prediction values averaged between the two ridge models at Step 4 of the CPOP procedure. Here, we choose to focus on two most common response data seen in clinical studies: • Binary classification response variable (e.g. good vs poor prognosis) can be modelled using the (penalised) logistic regression loss function. In logistic regression, the probability for assigning a sample i can be written as p test,i = 1/[1 + exp(−y test,i )].
• Survival time response variable (e.g. recurrence-free survival) can be modelled using (penalised) Cox proportional hazard loss function. In Cox proportional hazard model, the hazard ratio can be written as h i (t) h 0 (t) = exp(Z test,i β CPOP ) = exp(y test,i ).

Evaluation metrics and settings
We consider three broad classes of performance metrics to capture survival performance, classification performance and concordance performance. Supplementary Fig. 10 summarises the various metrics we use in connection with the CPOP training and testing data. In our evaluation setting, the majority of the metrics can be calculated under 100 repeated 5-fold cross-validations.
Survival performance metrics. Where survival time is available in the test data: 1. C-index: we take the predicted values from either a CPOP or a Lasso model and fit a classical Cox regression model (i.e. without penalisation) together with age and gender. The C-index 19 of this Cox model is reported. The C-index is defined as: where t i and t j are survival times of sample i and j respectively and η i and η j are linear predicted values in the classical Cox regression model. Survival analysis is performed using the survival package 20 .
2. KM-plot: we create a binary split of predicted samples at the hazard ratio of 1. This binary split means we can construct a Kaplan-Meier 21 survival plot (KM-plot) using the survminer package 22 . The log-rank statistic associated with this KM-plot is also reported.
3. Log-rank test p-value: similar to the KM-plot evaluation above, we also compute the log-rank test p-value of the binary split of predicted sample classes.
Concordance performance metrics. We propose two additional statistics to measure between-data concordance. For a validation data, we may calculate • a re-substituted value, Z test β test and use this value as the "gold-standard" against • a prediction value Z test β.
In the application to melanoma data collection, we evaluate CPOP against the competing Lasso model, where we use a Lasso-Cox model (results shown in Main Fig. 2E) for feature selection and then these features are then fitted using a ridge-Cox regression model. That is, we choose β to be the ridge regression coefficients , β R , fitted using only features selected CPOP or the Lasso. Denoting a = Z test β test and b = Z test β R . This procedure ensures that we can make fair comparisons across two distinct feature selection methodologies. Thus, three statistics we use are: 1. Pearson's correlation coefficient between a and b, which measures the concordance between the between-data prediction and the within-data re-substitution value. A higher positive value implies a higher quality of between-data prediction. To visualise this evaluation, the CPOP models and the Lasso-Cox models are placed on the x-axes of Main Fig. 2E and the re-substituted prediction values are placed on the y-axes and the Pearson's correlation is calculated.
2. Identity distance between a and b, which is defined as: This metric is based on the simple geometric fact that for a given point (a, b) ∈ R 2 , the quantity 1 √ 2 (|a − b|) measures the perpendicular distance between the point to the identity line y = x. Hence, this identity distance can measure the average deviation between the within-data resubstitution values (a) and the between-data prediction values (b). A lower value implies a better agreement between the two quantities.
3. Concordance correlation coefficient between a and b, which is defined as: This metric is defined by Lin 23 and it is a standard metric for evaluating reproducibility between two vectors measuring the same quantity. Its value can be expressed colloquially as: 1 − Expected squared perpendicular deviation from the identity line Expected squared perpendicular deviation from the identity line, assuming independence between a and b .
We will refer to this metric as "concordance" to avoid confusion with the more popular Pearson's correlation coefficient which considers the least square regression line.
Classification performance metrics. For cases where we are interested in binary classification (e.g. good vs poor prognosis) and all metrics below can be repeated applied under the 100 repeated 5-fold cross validation setting.
1. AUC-ROC: we use the yardstick package 24 to compute the AUC-ROC. Though it should be noted that this is not the most appropriate metric, its construction masks the effect of data scale differences.
2. We create a binary split of predicted samples at the predicted probability of 0.5 (e.g. predicted good prognosis class vs predicted poor prognosis class). The predicted class and true class labels defined for each data makes up a confusion matrix with four categories: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). The predicted binary class are evaluated using classical classification statistics and computed using the yardstick package 24 : (a) Precision of prediction, defined as: (b) Recall of prediction, defined as: (c) Balanced accuracy, defined as: (e) F 1 statistic, defined as: All of these metrics range from 0 to 1. In the application to melanoma data collection (results shown in Main Fig. 3B), we use the CPOP model to make prediction on the MIA-validation 46 samples (reduced from five batches each consist of 12 samples by removing technical replicates).

Evaluation on sex-imbalance in melanoma data
In melanoma, there is a sex-imbalance with males typically associated with more severe outcomes 25;26 . Given that the melanoma patients we considered in our study are mostly in Stage III, it is not surprising that we have observed more males in our study as shown in Table 1. In light of this imbalance due to biology, we evaluated the fairness of prediction results by calculating the equalised odds for the two sexes, males and females. Predicted outcomes are considered as fair if the sensitivities in the subgroups are close to each other (i.e. the equalised odds is close to 1). The group-specific sensitivities indicate the number of the true positives divided by the total number of positives in that group. This calculation is implemented in the fairness package 27 . For the CPOP model presented in Main Figure  3, for the TCGA dataset, the equalised odds for females and males are 1.000 and 0.957, respectively. For the Sweden dataset, the equalised odds for females and males are 1.000 and 0.987, respectively. Thus, we evaluated the CPOP algorithm to be fair between the two sexes and the prediction accuracy and outcome from our CPOP model has not been impacted by this sex imbalance.

Evaluation on ovarian data collection
We apply the CPOP procedure with a penalised Cox loss function on the Japan A 8 and Tothill 28 data as the dual training set 1 . This collection of ovarian data is heterogeneous ( Supplementary Fig. 11) in terms of survival times. Through careful data curation, we selected a subset of data where the range of survival times overlap between multiple data, however, the degree of separation by survival status still varies from data to data. Thus, this heterogeneity still impacts our analytics.

Evaluation on inflammatory bowel disease data
Treating IBD2 and IBD3 (raw data) as the training set, we apply the CPOP procedure with a penalised logistic loss function on the inflammation status of the samples. The selected features are then refitted back on IBD2 and IDB3 separately using ridge regression. These ridge regression models are then used to predict on the inflammation status of samples in IBD4 (Main Fig. 4C).

Supplementary Discussion
Additional remarks on scale differences between datasets Given two arbitrary omics data measuring the same set of features, there is a very small chance that these two data will have equal scale. Main Fig. 1B provides an illustration showing boxplots of five melanoma gene expression data with four clear differences in scale. This is because omics features are typically measured on a relative scale with unit-less numerical values that are proportional to some molecular units. Technical batch effect across omics data of different origins is a classical example of this inconsistency and its presence in data is a key reason as to why omics data of different origins cannot be readily combined for the implementation of a clinical prediction work. Failing to correct for this scale difference in the data can produce misleading interpretation in the final predicted values.
For example, in Supplementary Fig. 3, we compute a Lasso model from samples in MIA -Microarray, MIA -NanoString, TCGA Stage III samples 1 and Sweden 5 samples. Then, for the TCGA stage III samples and Sweden samples, we compute the re-substituted prediction values. These values, drawn on the y-axis, are considered as gold standard of what the estimated risk probabilities should be from independent data. The prediction values from MIA -Microarray-trained Lasso model and MIA -NanoString-trained Lasso model are then plotted on the x-axis and coloured. Clearly, depending which data is used in the training of the Lasso model, we may arrive at different scale in the prediction. This comparison is particularly illustrative of the challenge associated with prediction in the presence of gene scale difference, as Main Fig. 1C shows MIA -Microarray and MIA -NanoString share high concordance for matched samples and yet their prediction values can show a large amount of variation.
Existing statistical methods aim to resolve gene scale differences through normalisation which inevitably requires estimation of data specific scales (e.g. mean and median) across all patient samples and a way to combine the samples. This procedure is often restrictive, since in a prospective experiment, it is not clear how a single sample should be combined with an existing study cohort (e.g. samples from previous studies frozen as a reference bank). Combining a single-patient's data with different cohorts and performing normalisation could lead to different predicted values for the same patient depending on the choice of the cohort.
The most popular method for adjusting data scales between different datasets is through the use of normalisation and this type of method can be divided into two broad categories: z-score standardisation and between-data normalisation. We refer to z-score standardisation as a pre-processing step where each omics predictor is centred at 0 and its sample variance is scaled to 1. This procedure has been widely used 7;29;30;31;32;33 . On the other hand, betweendata normalisation aims to correct the statistical distributions of genes in validation data to be similar to that of training data. Rudy et. al. 34 provides an evaluation of nine different between-data normalisation methods and more recent updates on this kind of normalisation include Taroni et. al. 35 and Thompson et. al. 36 . Through the act of combining multiple data and calculating summary statistics like sample-wise mean, both of these methods introduce interdependencies between the samples and may not be suitable for processing and predicting on a single omics profile in Criterion 15 of McShane et. al. 37 . Though withinsample standardisation methods such as in Le Cao et. al. 38 have the potential to bypass some of these constraints, in this manuscript, we will operate under the assumption that re-normalisation for incoming samples together with existing cohort is not practical due to constraint with consent, data security or other reasons.
Data normalisation alone is not enough to address the various challenges in implementing precision medicine utilising omics data with additional considerations relating to the model stability and reproducibility also being necessary. We will demonstrate this effect using a popular normalisation method, ComBat 39 . We first took the log-ratio matrices of MIA-Microarray, MIA-NanoString and TCGA from Supplementary Table 1 and performed the ComBat normalisation, available through the sva package in R. See Supplementary Figure  12 for the sample-wise boxplots before and after the ComBat normalisation. The normalised data is then split back to their original respective sources. Lasso models are fitted on the MIA-Microarray-normalised and MIA-NanoString-normalised data and the TCGA data is used as the validation data for both Lasso models. The prediction values from both Lasso models are then compared against each other. The CPOP model in Main Figure 3 was used again to compare the prediction values between MIA-Microarray and MIA-NanoString. Note that in the case of CPOP, no normalisation was performed on the log-ratios. Supplementary Figure 13a shows the prediction values from the Lasso models exhibit clear bias away from the identify line. This comparison of prediction values was designed to give the normalised data an advantage, as the TCGA validation data was also used in the normalisation, which created data leakage in the between-data modelling. Despite this, we see that the Lasso prediction values exhibit bias away from the identity line (red line). We compare this against Supplementary Figure 13b, where the CPOP prediction values are shown to have much lower bias. Thus, we see that the instability of prediction is not always fully addressed by normalisation.
To confirm that the observed bias effect is not due to random seeds in computation, 100 bootstrap samplings were performed on the samples of the normalised and unnormalised data with the Lasso and the CPOP model separately. The concordance and correlation between the predicted values are used to summarise the presence of bias as shown in the boxplots in Supplementary Figure 13C and Supplementary Figure 13D, respectively.
While there are numerous modelling approaches we could have taken, for example, rankbased approach such as Eddy et. al. 40 and Afsari 41 or tree-based approach like Breiman et. al. 42 and Ishwaran et. al. 43 ; we ultimately opt for a regression-based modelling approach because we want to efficiently utilise all gene expression information rather than summarising into ranks and wish to place a greater emphasis on model/feature interpretability. On the y-axis, we plot the re-substituted Lasso models where we trained and tested on the TCGA data and Sweden. As the re-substituted values (y-axis) are identical between data platform, we can then fairly compare the scale of prediction values between the MIA-Microarray-trained model and the MIA-NanoString-trained model (x-axis) and see an obvious scale difference between the two. Furthermore, both of these between-data prediction values are on a very different centering/scale to the re-substituted ("gold-standard") values from TCGA and Sweden.

Supplementary
Supplementary Figure 4: We perform 20 repeated 5-fold cross-validation to select 100 different sets of features where each CPOP and Lasso feature set is constructed from 80% of MIA-Microarray and MIA-NanoString samples. We compare the CPOP and Lasso predicted probabilities performance (light blue and light green) in for TCGA and Sweden in panel a and b respectively. We also add the within-data re-substituted values for CPOP and Lasso for reference (darker blue and darker green). While within-data performance of CPOP and Lasso are similar, CPOP consistently perform better than Lasso in terms of between-data performance.
Supplementary Figure 5: Assessment of prediction model with missing values. We introduce missing values into the TCGA data by randomly removing 5 genes and use the training model to impute the missing genes before calculating the log-ratio matrices and corresponding CPOP prediction. The process randomly removing of 5 genes is repeated 100 times. Panel a illustrates the prediction performance using five difference performance metrics (balance accuracy, F1, MCC, precision and recall). The red points represent the prediction performance on the complete TCGA without any missing values. Panel b shows the ROC curves for the prediction performance on imputed data (black line) where we predict on the TCGA data with 5 genes randomly removed and the complete TCGA data (red line).
Supplementary Figure 6: Assessment of the impact of missing features in predicting TCGA data. Assuming missing features at random, the described imputation method is able to handle up to approximately 20 of the 163 (12%) of missing columns without significantly impacting on the correlation and concordance of prediction. Kaplan-Meier plots show a significant difference between the predicted good (blue line) and poor (orange line) prognostic classes on different training-testing pairs from four of the melanoma data collection (MIA-Microarray, MIA-NanoString, TCGA and Sweden). Here, we show that the general applicability of the CPOP procedure through KM-plots of all 24 combinations of training-testing set of the MIA-Microarray, MIA-NanoString, TCGA and Sweden data. While not every training and testing combination shows a statistical difference, a majority (19 out of 24, or 79%) of pairs did.
Supplementary Figure 9: a Quartile plot of unnormalised log-ratios across three melanoma data. b Quartile plot of ComBat normalised log-ratios across the same three melanoma data. c Comparing CPOP prediction values on the TCGA unnormalised data. Each point represents a TCGA sample. d Comparing Lasso prediction values on the TCGA Combat normalised data. We note the increased bias compared to panel A. e and f Under 100 bootstrap sampling on the MIA-NanoString and MIA-Microarray data (i.e. the training data), we repeatedly calculate the concordance and correlation between the prediction values for the TCGA data, respectively. The Wilcoxon rank sum test p-value for the concordance and correlation metrics are 1.2e-15 and <2e-16 respectively. Supplementary Figure 10: Schematic illustration of the evaluation framework for CPOP using three different classes of metrics.
Supplementary Figure 11: Distribution of the survival time for the Ovarian cancer data collection, stratified by survival status. All data illustrated here uses overall survival time.
Supplementary Figure 12: Assessment of the ovarian data with survival performance metrics. Kaplan-Meier plot of CPOP prediction on all the nine ovarian data showing the survival probability between the predicted good (blue line) and poor (orange line) prognostic classes. We build the model using the CPOP procedure with a penalised Cox loss function on the Japan A 8 and Tothill 28 data. Five out of the nine data show a statistical significance between good and poor predicted outcomes. This illustrates the high predictive strength of the CPOP method. While not all data achieved statistical significance as the original publication 7 , it should be noted that z-score standardisation (standardisation with mean equal to 0 and variance equial to 1) is applied on all data prior to modelling in 7 . On the other hand, our CPOP evaluation avoids this transformation for the reason that we wish to best evaluate our method in a single-sample prediction situation where z-transformation is not impossible.
Supplementary Figure 13: a Quartile plot of gene expression values for the IBD data, coloured by the reagent codeset. Each sample is represented by its median (a single solid point), the 25% quantile of the gene/features (the lower end of a vertical line) and 75% quantile of the gene/features (the upper end of a vertical line). b PCA plot using the 3rd and 4th components, coloured by the reagent codeset.

Supplementary Tables
• Supplementary Table 1